Skip to main content
Qubrid AI
NVIDIA · Chat / LLM · 120B Parameters (12B Active) · 256K Context (up to 1M)Qubrid Playground License HuggingFaceFunction Calling Streaming Reasoning Agent Workflows Long Context Code Tool Use

Overview

NVIDIA Nemotron-3 Super 120B A12B FP8 is an open-weight LLM built for agentic reasoning and high-volume enterprise workloads. Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. With a native 1M-token context window, configurable thinking mode, and 60.47% on SWE-Bench Verified, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages — served instantly via the Qubrid AI Serverless API.
2.2x throughput vs GPT-OSS-120B. 1M token context. 512 experts, 22 active per token. Deploy on Qubrid AI — no H100 cluster required.

Model Specifications

FieldDetails
Model IDnvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
ProviderNVIDIA
KindChat / LLM
ArchitectureLatentMoE — Mamba-2 + MoE + Attention hybrid with MTP; 512 experts, 22 active per token; 120B total / 12B active
Parameters120B total (12B active per inference pass)
Context Length256K Tokens (up to 1M)
MoENo
Release DateMarch 11, 2026
LicenseNVIDIA Nemotron Open Model License
Training Data25T token corpus (NVFP4 native pretraining): web, code, math, science, multilingual; post-training cutoff February 2026; pre-training cutoff June 2025
Function CallingSupported
Image SupportN/A
Serverless APIAvailable
Fine-tuningComing Soon
On-demandComing Soon
State🟢 Ready

Pricing

💳 Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
Token TypePrice per 1M Tokens
Input Tokens$0.10
Input Tokens (Cached)$0.04
Output Tokens$0.50

Quickstart

Prerequisites

  1. Create a free account at platform.qubrid.com
  2. Generate your API key from the API Keys section
  3. Replace QUBRID_API_KEY in the code below with your actual key
💡 Temperature & Top P: Use temperature=1 and top_p=0.95 — recommended for all tasks with this model.

Python

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=16000,
    temperature=1,
    top_p=0.95,
    stream=True
)

# If stream = False comment this out
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)

JavaScript

import OpenAI from "openai";

// Initialize the OpenAI client with Qubrid base URL
const client = new OpenAI({
  baseURL: "https://platform.qubrid.com/v1",
  apiKey: "QUBRID_API_KEY",
});

// Create a streaming chat completion
const stream = await client.chat.completions.create({
  model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  messages: [
    {
      role: "user",
      content: "Explain quantum computing in simple terms",
    },
  ],
  max_tokens: 16000,
  temperature: 1,
  top_p: 0.95,
  stream: true,
});

// If stream = false comment this out
for await (const chunk of stream) {
  if (chunk.choices[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}
console.log("\n");

// If stream = true comment this out
console.log(stream.choices[0].message.content);

Go

package main

import (
	"bufio"
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
)

func main() {
	url := "https://platform.qubrid.com/v1/chat/completions"

	data := map[string]interface{}{
		"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
		"messages": []map[string]string{
			{
				"role":    "user",
				"content": "Explain quantum computing in simple terms",
			},
		},
		"temperature": 1,
		"max_tokens":  16000,
		"stream":      true,
		"top_p":       0.95,
	}

	jsonData, _ := json.Marshal(data)
	req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
	req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
	req.Header.Set("Content-Type", "application/json")

	client := &http.Client{}
	res, _ := client.Do(req)
	defer res.Body.Close()

	scanner := bufio.NewScanner(res.Body)
	for scanner.Scan() {
		line := scanner.Text()
		if line != "" {
			fmt.Println(line)
		}
	}
}

cURL

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16000,
  "stream": true,
  "top_p": 0.95
}'

Live Example

Prompt: What are the benefits of renewable energy?
Response:
Renewable energy offers a broad set of benefits spanning environmental,
economic, and strategic dimensions:

Environmental
  • Zero direct emissions during operation — solar, wind, and hydro produce
    no CO₂, SOₓ, or NOₓ at the point of generation
  • Reduces dependence on fossil fuel extraction, lowering habitat disruption,
    water contamination risk, and land use pressure
  • Contributes directly to national and international decarbonization targets
    (Paris Agreement, net-zero commitments)

Economic
  • Levelized cost of energy (LCOE) for solar and onshore wind is now lower
    than new coal or gas plants in most markets
  • Creates long-term stable energy pricing — no fuel cost volatility
  • Generates local employment in manufacturing, installation, and maintenance
  • Attracts investment: clean energy is the fastest-growing segment of
    infrastructure capital globally

Energy Security
  • Reduces import dependency on oil, gas, and coal from geopolitically
    unstable regions
  • Distributed generation (rooftop solar, microgrids) improves grid
    resilience against outages and cyberattacks
  • Nations with abundant sun, wind, or hydro can achieve near-complete
    domestic energy independence

Technological Spillovers
  • Advances in battery storage, smart grids, and power electronics driven
    by renewables deployment benefit the broader economy
  • Supports electrification of transport, heating, and industrial processes

Tradeoffs to consider:
  • Intermittency (solar/wind) requires storage or grid balancing solutions
  • Large-scale installations have land and materials requirements
  • Upfront capital costs remain higher for some technologies vs. fossil alternatives

Net assessment: for most economies, renewable energy now represents the most
cost-effective, low-risk long-term energy strategy available.
Try it yourself in the Qubrid AI Playground →

Playground Features

The Qubrid AI Playground lets you interact with Nemotron-3 Super 120B directly in your browser — no setup, no code, no cost to explore.

🧠 System Prompt

Define the model’s role, reasoning mode, and output constraints before the conversation begins. Particularly powerful for agentic pipelines, tool-use workflows, and structured enterprise tasks.
Example: "You are an enterprise IT automation agent. Analyze incoming support
tickets, classify them by severity and category, suggest resolution steps,
and escalate critical issues with a structured JSON summary."
Set your system prompt once in the Qubrid Playground and it applies across every turn of the conversation.

🎯 Few-Shot Examples

Guide the model’s output structure and reasoning depth with concrete examples — no fine-tuning required. Especially effective for structured outputs and multi-step agentic tasks.
User InputAssistant Response
Ticket: "Server keeps crashing every 12 hours." Priority?Priority: HIGH. Category: Infrastructure Stability. Suggested action: Check system logs for OOM events, review cron jobs scheduled near crash window, and verify disk I/O health.
Summarize this 50-page policy document in 5 bullet points• Scope: Applies to all employees handling customer PII. • Key requirement: Data must be encrypted at rest and in transit. • Breach protocol: Notify DPO within 72 hours. • Retention: 7-year maximum. • Non-compliance: Subject to disciplinary review.
💡 Stack multiple few-shot examples in the Qubrid Playground to shape agentic behavior, output schema, and reasoning verbosity — no fine-tuning required.

Inference Parameters

ParameterTypeDefaultDescription
StreamingbooleantrueEnable streaming responses for real-time output
Temperaturenumber1Controls randomness in output. Recommended: 1.0 for all tasks
Max Tokensnumber16000Maximum tokens to generate
Top Pnumber0.95Controls nucleus sampling. Recommended: 0.95 for all tasks

Use Cases

  1. Agentic workflows and multi-agent collaboration
  2. Long-context reasoning (up to 1M tokens)
  3. IT ticket automation and high-volume enterprise workloads
  4. Complex tool use and multi-step function calling
  5. RAG (Retrieval-Augmented Generation)
  6. Software engineering and cybersecurity triaging

Strengths & Limitations

StrengthsLimitations
LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoERequires minimum 2× H100-80GB GPUs for local deployment
2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122BThinking mode adds latency overhead; low-effort mode recommended for simple queries
60.47% SWE-Bench Verified; 83.73% MMLU-Pro; 79.23% GPQANot optimized for vision or multimodal inputs
Native 1M token context — 91.75% on RULER @ 1MFunction calling supported but may need prompt engineering for complex schemas
MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup)
Configurable reasoning mode via enable_thinking=True/False

Why Qubrid AI?

  • 🚀 No infrastructure setup — 120B MoE served serverlessly, pay only for what you use
  • 🔁 OpenAI-compatible — drop-in replacement using the same SDK, just swap the base URL
  • 💰 Cached input pricing — $0.04/1M for cached tokens, critical for long-context and repeated RAG workloads
  • Throughput-optimized — Nemotron’s 2.2x speed advantage is fully realized on Qubrid’s low-latency infrastructure
  • 🧪 Built-in Playground — prototype with system prompts and few-shot examples instantly at platform.qubrid.com
  • 📊 Full observability — API logs and usage tracking built into the Qubrid dashboard

Resources


Built with ❤️ by Qubrid AI

Frontier models. Serverless infrastructure. Zero friction.